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Abstract The global sensitivity analysis method, used to 
quantify the influence of uncertain input variables on the 
response variability of a numerical model, is applicable to 
deterministic computer code (for which the same set of in- 
" put variables gives always the same output value). This pa- 
per proposes a global sensitivity analysis methodology for 
, stochastic computer code (having a variability induced by 
some uncontrollable variables). The framework of the joint 
modeling of the mean and dispersion of heteroscedastic data 
is used. To deal with the complexity of computer experi- 
ment outputs, non parametric joint models (based on Gen- 
\ eralized Additive Models and Gaussian processes) are dis- 
■ cussed. The relevance of these new models is analyzed in 
' terms of the obtained variance-based sensitivity indices with 
two case studies. Results show that the joint modeling ap- 
proach leads accurate sensitivity index estimations even when 
clear heteroscedasticity is present. 
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1 Introduction 

Many phenomena are modeled by mathematical equations 
which are implemented and solved by complex computer 
code. These computer models often take as inputs a high 
number of numerical variables and physical variables, and 
give several outputs (scalars or functions). For the devel- 
opment of such computer models, its analysis, or its use, 
the globa l Sensitivity Analysis (SA) method is an invalu- 



able t ool dSalteUi et all kood iKleiineni 1 1 9971: iHelton et al 



20061) . It takes into account all the variation ranges of the in- 
puts, and tries to apportion the output uncertainty to the un- 
certainty in the input factors. These techniques, often based 
on the probabilistic framework and Monte-Carlo methods, 
require a lot of simulations. The uncertain input variables 
are modeled by random variables and characterized by their 
probabiUstic density functions. The SA mefliods a re used 
for model calibration ("Kennedv and O'Hagan , 2001 ), model 



validation (Bayarii et al., 2007), decision making process dPe Rocquignv et al, 
20081), i.e. all the processes where it is useful to know which 
variables mostly contribute to output variability . 

The current SA methods are applicable to the determin- 
istic computer code, i.e. for which the same set of input vari- 
ables always gives the same output values. The randomness 
is limited to the model inputs, whereas the model itself is 
deterministic. For example in the nuclear engineering do- 
main, global sensitivity anal ysis tools have bee n applied to 
waste storage safety studies ( Helton et al.,'2006'), and pollu- 
tant transport modeling in the aquifer ( VoUcova et al.. 20081) . 
In such industrial studies, numerical models are often too 
time consuming for applying directly the global SA meth- 
ods. To avoid this problem, one solution consists in replac- 
ing the time consuming computer code by an approximate 



mathematical mo del, called metamodel dSacks et aL , 1989t 



Fang et al. , 20061) . This function must be as representative as 
possible of the computer code, with good prediction capabil- 
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ities and must require a negligible calculation time. Several 
metamodels are classically used: polynomials, splines, n eu- 



ral ne tworks, Gaussian processes dChen et al...,2006;,Fang et al 
2Q06h . 



In this paper, we are not concerned by deterministic com- 
puter models but by stochastic numerical models - i.e. when 
the same input variables set leads to different output values. 
The model itself relies on probabilistic methods (e.g. Monte- 
Carlo) and is therefore intrinsically stochastic because of 
som e "un controllable variables". For the uncertainty anal- 
ysis, [Kleiinen (1997) has raised this question, giving an ex- 
ample concerning a queueing model. In the nuclear engi- 
neering domain, examples are given by Monte-Carlo neu- 
tronic models used to calculate elementary particles trajec- 
tories, Lagrangian stochastic models for simulating a large 
number of particles inside turbulent media (in atmospheric 
or hydraulic environment). In our study, "uncontrollable" 
variables correspond to variables that are known to exist, 
but unobservable, inaccessible or non describable for some 
reasons. It includes the important case in which observable 
vector variables are too complex to be described or synthe- 
sized by a reasonable number of scalar parameters. This last 
situation might concern the code for which some simula- 
tions of complex random processes are used. For example, 
one can quote some partial differential equation resolutions 
in heterogeneous random media simulated by geostatistical 
techn iques (e.g. fluid flows in oil reservoirs. .Zabalza et all 
19981 and acoust ical wave propagation in turbulent fluids. 



looss et al.L 12002 ). where the uncontrollable variable is the 



simulated spatial field involving several thousand scalar val- 
ues for each realization. Of course, in this case, behind this 
uncontrollable variable stands a fully controllable param- 
eter: the random seed. However, the effect of the random 
seed on the computer code output is totally chaotic because 
a slight modification of the random seed leads to a very dif- 
ferent random medium realization. For simplicity and for 
generality, we use the expression "uncontrollable variable" 
in this paper. 

For stochastic computer models, classical metamodels 
(devoted to approximate deterministic computer models) are 
not pertinent anymore. To overcome this probl em, the com 



monly used Gaussi an process (Gp) model ( Sack s et aUll989t 



Marrel et al.L 120081) can include an additive error component 
(called the "nugget effect") by adding a constant term into 
its covariance function. However, it supposes that the error 
term is independent of the input variables (homoscedasticity 
hypothesis), which means that the uncontrollable variable 
does not interact with controllable variables. This hypoth- 
esis limits the use of Gp to specific cases even if recently, 
some authors (e.g. iKleiinen and van Beersl l2005h demon- 
strated the usefulness of Gp for stochastic computer model 
in heteroscedastic cases. 



To construc t heteroscedastic me tamodels for stochastic 
computer code, ^ZabalzaetaL ( 1998 ) have proposed another 
approach by modeling the mean and the dispersion (i.e. the 
variance) of computer code outputs by two interlinked Gen- 
eralized Linear Models (GLMs). This approach, called the 
joint model, has been prev iously studied in the context of ex- 
perim ental data modeling (ISmvthill989l : lMcCullagh and Neldei 
1989I) . 



Modeling the mean and variance of a response variable 
in function of some controllable variables is of primary con- 
cern in product dev elopment and quality engineering meth- 
ods (iPhadkei Il989h : experimentation is used to determine 
the factor levels so that the product is insensitive to po- 
tential variations of environmental conditions. This can be 
summarized, in the framework of the robust design, as the 
optimization of a mean response functi on while minimiz- 
ing a variance function. In this context, Vining and MversI 



1990l) propose to build polyno mial models for the mea n and 



the variance separately, while Lee and Nelden (l2003h con- 
sider the joint GLM approach. A recent and complete re- 
view o n this subject can be found in Bursztyn and Steinberg! 
( 20061) . D ealing with compu ter experiments instead of phys- 
ical ones. Bates et al. I (l2006l) propose different strategies for 
designing and analyzing robust design experiments. In this 
case, the noise factors are fully controllable. This allows the 
authors to provide a powerful stochastic emulator strategy. 



Fo l lowin g the work of lZabalza et al.l (ll998l) . IIooss and Ribatet 
120091) have recently introduced the joint model to 
perform a global sensitivity analysis of a stochastic com- 
puter code. Results show that a total sensitivity index of all 
the uncontrollable variables can be computed using the dis- 
persion component of the joint model. However, the para- 
metric form of the GLM framework provides some limita- 
tions when modeling complex computer code outputs. To 
bypass this hurdle, this paper suggests the use of non para- 
metric models. Due to its similarity with G LMs, General- 

ized Additive Models (GAM) are considered ( Hastie and Tibshiranil 
199d IWood and Augustini hooA even though Gp or other 



non parametric models should also be a relevant solution. 

This paper starts by describing the joint model construc- 
tion, firstly with the GLM, secondly with the GAM. We 
will also show how other models, like Gp, can be used to 
model the mean and dispersion components. The third sec- 
tion describes the global sensitivity analysis for determinis- 
tic models, and its extension to stochastic models using joint 
models. Particular attention is devoted to the calculation of 
variance-based sensitivity indices (the so-called Sobol in- 
dices). Considering a simple analytic function, the perfor- 
mance of the proposed approach is compared to other com- 
monly used models. Next, an application on an actual indus- 
trial case (groundwater radionuclide migration modeling) is 
given. Finally, some conclusions synthesize the contribu- 
tions of this work. 
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2 Joint modeling of mean and dispersion 

2.1 Using the Generalized Linear Models 

The class of GLM allows to extend the class of the tradi- 
tional linear models by the use of: (a) a distribution which 
belongs to the exponential family; (b) and a link function 
which connects the explanatory variables to the explained 
variable ( Nelder and Wedderburg 1972). Let us describe the 
first component of the model concerning the mean: 



\var(}^) 



rii^gifii)^ZjXijlij , 



(1) 



where (}^ ),=i ..„ are independent random variables with mean 
jj-i; Xij are the observations of the variable Xj; jij are the re- 
gression parameters which have to be estimated; Tj,- is the 
mean linear predictor; g{-) is a differentiable monotonous 
function (called the link function); 0,- is the dispersion pa- 
rameter and v(-) is the variance function. To estimate the 
mean component, the functions g{-) and v(-) have to be spec- 
ified. Some examples of link functions are given by the iden- 
tity (traditional linear model), root square, logarithm, and 
inverse functions. Some examples of variance functions are 
given by the constant (traditional linear model), identity and 
square functions. 

Within the joint model framework, the dispersion pa- 
rameter 0/ is not supposed to be constant as in a traditional 
GLM, but is supposed to vary according to the model: 



0/, Ci 

TV,, ( 0,0 , 



(2) 



where d, is a statistic representative of the dispersion, jj are 
the regression parameters which have to be estimated, h{-)is 
the dispersion link function, Q is the dispersion linear pre- 
dictor, T is a constant and v^( ) is the dispersion variance 
function. M,y are the observations of the explanatory variable 
Uj. The variables (Uj) are generally taken among the ex- 
planatory variables of the mean (Xj), but might differ. To 
ensure positivity, h{^) = log is often taken for the disper- 
sion link function while the statistic d representing the dis- 
persion is generally taken to be the deviance contribution - 
which is approximately distributed. Therefore, as the 
distribution is a particular case of the Gamma distribution, 
Vi/(0) = 0" and T ^ 2. 

The joint mo del is fitted using the Extend ed Quasi Log- 
likelihood (EQL, Nelder and Pregibon . 198 7*) maximization. 
The EQL behaves as a log-likelihood for both mean and dis- 
persion parameters. This justifies an iterative procedure to 
fit the joint model. First, a GLM is fitted on the mean; then 
from the estimate of d, another GLM is fitted on the disper- 
sion. From the estimate of (j), weights for the next estimate of 



the GLM on the mean are obtained. This process can be re- 
iterated as many ti mes it is necessary, and allows to entirely 
fit our joint model iMcCullagh and Nelder, 1989). 

Statistical tools available in the GLM fitting are also 
available for each component of the joint model: deviance 
analysis and Student test. It allows to make some variable se- 
lection in order to simplify model expressions. The residuals 
graphical analysis (which have to be normally distributed) 
and the q-q plots can be used as indicators o f the correctness 
of the link function for the mean component (ILee and Nelder 



20031). In practice, some ev idence can lead to an adequate 



choice of the link function (IMcCullagh and Nelden Il989h . 
For example, a binomial-type explained variable leads to the 
use of the logit function. However, if a natural choice is not 
possible and if the identity link function does not provide 
satisfactory residuals analysis, plotting the adjusted depen- 
dent variable versus the linear predic tor might help in choos- 
ing a more appropriate link function ( IMcCullagh and Nelder 
1989). 

In conclusion, all the results obtained on the joint GLM 
are applicable to the problem of stochastic computer ex- 
periments. The novelty proposed in our paper concerns the 
global sensitivity analysis issue (section [32]i. Moreover, in 
the following section we extend the joint GLM to the non 
parametric framework. This kind of model is necessary for 
the computer experiment outputs which tend to be rather 
complex and need non parametric modeling. 
Remark: A simpler approach consists in building p olyno- 



mial models for the mean and the var iance separately Wining and Myers 



199(k Bursztyn and Steinberg 2006). This approach, called 



dual modeling, consists in repeating calculations with the 
same sets of controllable variables (which is not necessary 
in the joint modeling approach). The dual modeling approach 
has been successfully applied in many situations, especially 
for robust conception problems. However for our purpose 
(accurate fitting of the mean and dispersion components), 
it has been shown th at this dual model is less competitive 
than the joint model i Zabalza et al. 199^ Lee and Nelder 



2003b : the dual modeling approach fits the dispersion model 



given the mean model and this approach does not always 
lead to optimal fits. 

2.2 Extension to the Generahzed Additive Models 

Generalized Additive models (GAM) were introduced by 
Hastie and Tibshirani (1990.) and allow a linear term in the 
linear predictor rj = Y^j PjXj of equation ^ to be replaced 
by a sum of smooth functions rj = Y^j^'ji^j)- The 5;(.)'s are 
unspecified functions that are obtained by fitting a smoother 
to the data, in an iterative procedure. GAMs provide a flexi- 
ble method for identifying nonlinear covariate effects in ex- 
ponential family models and other likelihood-based regres- 
sion models. The fitting of GAM introduces an extra level 
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of iteration in which each spUne is fitted in turn assuming 
the others known. GAM terms can be mixed quite generally 
with GLM terms in deriving a model. 

One common choice for sj is the smoothing spline, i.e. 
splines with knots at each distinct value of the variables. In 
regression problems, smoothing splines have to be penalized 
in order to avoid data overfitting. lWood and AugustinI (120021) 
have described in details how GAMs can be constructed us- 
ing penalized regression splines. Because numerical models 
often exhibit strong interactions between input variables, the 
incorporation of multi-dimensional smooth (for example the 
bi-dimensional spline term Sij{Xi,Xj)) is particularly impor- 
tant in our context. 

GAMs are generally fitted using penalized likelihood 
maximization. For this purpose, the likelihood is modified 
by the addition of a penalty for each smooth function, penal- 
izing its "wiggliness". Namely, the penaUzed loghkehhood 
is defined as: 

2 

(3) 

where £ is the loglikelihood function, p is the total number 
of smooth terms and are "smoothing parameters" which 
compromise between goodness of fit and smoothness. 

Estimation of these "smoothing parameters" is gener- 
ally achieved using the GCV score minimization. The GCV 
score is defined as: 




Sgcv 



nd 



{n-DoFf 



(4) 



where n is the number of data, d is the deviance and DoF 
is the effective degrees of freedom, i.e. the trace of the so- 
called "hat" matrix. Extension to (E)QL models is straight- 
forward by substituting the likelihood function and the de- 
viance d for their (extended) quasi counterparts. In practice, 
all the smoothing parameters are jointly updated at each iter- 
ation of the fitting procedure of the joint model. To this aim, 
on every iteration a GLM/GAM is fitted for each trial set of 
smoothing parameters, and GCV scores are only evaluated 
at convergence. 

We have seen that GAMs extend in a natural way GLMs. 
Therefore, it would be interesting to extend the joint GLM 
m odel to a joint GAM one. Such i deas have been proposed 



Rigbv and Stasinopoulos ( 1996h where both the mean and 



m 

variance wer e modeled using semi-parametric additive mod- 
els (Hastie and Tibshirani, 1990). This model is restricted to 
observations following a Gaussian distribution and is called 
Mean and Dispersion Additive Model (MADAM). Our model 
is more general and relaxes the Gaussian assumption as now 
quasi-distributions are considered: while the MADAM fit- 
ting procedure relies on the maximization of the penalized 
likelihood, the joint GAM m aximizes the penahzed extended 
quasi-like lihood. In addition. lRigbv and Stasinopoulosl(ll996l) 



only used cubic regression splines, while our framework al- 
lows also the use of multivariate smoothers - e.g. thin plate 
regression splines. As our model is based on GAMs and 
by analogy with the denomination "joint GLM", we call it 
"joint GAM" in the following. 

Lastly, it has to be noticed that, within the EQL maxi- 
mization framework, a large number of models can be con- 
sidered instead of GAMs. For instance, one can use a GAM 
for the mean response and a GLM for the dispersion com- 
ponent. In addition, more complex models can also be con- 
sidered such as Gaussian processes - see Section l23] 



2.3 Joint modeling with other models 

For some applications, joint GAM could be inadequate, and 
other models can be proposed. For examp le, for Gaussian 
observations. lJuutilainen and RoningI (120051) have used a neu- 
ral network model for mean and dispersion. It is shown to 
be more efficient than joint GLM and joint additive models 
in a context of numerous explanatory variables (25) and of 
a large amount of data (100000). They perform an exten- 
sive comparison for large data sets between joint neural net- 
work model, MADAM, joint local linear regression model 
and joint linear regression model. While our context of com- 
puter experiments is different (we have small data sets), it is 
interesting to recall their conclusion: 

- the neural network joint model gives the best prediction 
performance; 

- MADAM requires a huge amount of memory; 

- joint local linear model is extremely time consuming; 

- joint linear model is appropriate when simplicity is re- 
quired. 

It is also possible to build a heteroscedastic model based 
on the Gaussian p rocess (Gp) metam odel (also known as the 



kriging principle. ISacks et all Il989f) . The Gp approach es 



sentially is a kind of linear interpolation built on the prop- 
erty of the multivariate normal distribution. Gp metamodel 
gives not only a predictor (which is the best linear unbi- 
ased predictor) of a computer experiment but also a local 
indicator of prediction accu racy. For heteroscedastic d ata, a 
first approach, proposed bv iGinsbourger et alj ( |2008b , con- 
sists in modeling the mean of the computer code with a Gp 
metamodel for which the nugget effect is supposed to vary 
with the inputs. From this fitted Gp, one can use the estima- 
tion of the MSE (given by the Gp model) as the dispersion 
statistic d introduced in Equation (|2]l. This model does not 
require any fitting of the dispersion component and we pre- 
fer to focus our attention on another method, the joint Gp 
model, which is coherent with our previous joint models. 
Boukouvalas and CornfordI ( 1200 9) have recently introduced 
a such joint Gp model for the same purpose. 
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The first step of our methodology models the mean by 
a Gp metamodel (having a nugget effect) estimated on the 
learning sample. The second step consists in adjusting a sec- 
ond Gp metamodel on the squared residuals. This process 
can be iterated as in the joint GLM and joint GAM fitting 
procedure. Due to the presence of a nugget effect in the 
mean component, the mean Gp is not anymore an exact in- 
terpolator and the learning sample residuals can be used for 
the dispersion model. However, residuals could also be de- 
rived from a cross validation method. 



3 Global sensitivity analysis 

3.1 Deterministic models 

The global SA methods are applicable to deterministic com- 
puter code, e.g. for which the same set of input variables 
always leads to the same response value. This is considered 
by the following model: 



X 



(5) 



>Y=fiX) 

where /(•) is the model function (possibly analytically un- 
known), X = {Xi ,Xp) are p independent inputs and Y is 
the output. In our problem, X is uncertain and considered as 
a random vector with known distribution which reflects this 
uncertainty. Therefore, Y is also a random variable, whose 
distribution is unknown. In this section, let us recall some 
basic ideas on the variance-based sensitivity indices, called 
Sobol indices, applied on this model. 

Among quantitative methods, variance -based methods 



are the most often used jSaltelli et aUl200Q) . The main idea 



of these methods is to evaluate how the variance of an input 
or a group of inputs contributes into the variance of output. 
We start from the following variance decomposition: 

Var [Y] = Var [E {Y \Xi)] + E [Var {Y\X,)] , (6) 

which is known as the total variance theorem. The first term 
of this equality, named variance of the conditional expecta- 
tion, is a natural indicator of the importance of Xi into the 
variance of Y : the greater the importance of Xi, the greater is 
Var[E(y|X,)]. Most often, this term is divided by Var[F] to 
obtain a sensitivity index in [0, 1]. 

To express the sensitivity indices, we use the unique de- 
composition of any integrable function o n [0, IjP into a sum 
of elementary functions (see for example lSoboll 1 1993h : 



(7) 



f{Xi,--- ,Xp) ^ fo + Y^fiiX,) + Y^f,jiXi,Xj) 

'■ '■<7 
+ ---+/l2..p(^l,--- i^p) I 

where /o is a constant and the other functions verify the fol- 
lowing conditions: 
1 

fii,...,is{xi^,...,xijdxi^ =0 (8) 



L 



\fk = I,. . . ,s , V{/i /j} C { 1 ,...,/?}. If the X/s are mu- 
tually independent, the decomposition (|7]i is valid for any 
distribution functions for the X,s. 

From (|7]), the followi ng decompos ition of the model out- 
put variance is possible (ISobollll993b : 



var[F] = f y,(y) + ty,;(y)+ f v.j,{y) 

i i<j i<i<k 

+ --- + yn..p{Y) , 

where Vi{Y)= Var[E(y , Vij (Y) = Var[E(y \XiXj)] - Vi (Y) 
Vj{Y),. . . One can thus define the sensitivity indices by: 



Var[E(y|X,-)] Vi{Y) 



Var{Y) 

Ml 

Var(F) ' 



Sijk 



Var(y) ' 
Vijk{Y) 



(10) 



Var(y) ' 



These coefficients are called the Sobol indices, and can be 
used for any complex model functions /. The second order 
index Sij expresses sensitivity of the model to the interac- 
tion between the variables Xi and Xj (without the first order 
effects of Xi and Xj), and so on for higher orders effects. The 
interpretation of these indices is natural as their sum is equal 
to one (thanks to equation (|9|l): the larger and close to one 
an index value, the greater is the importance of the variable 
or the group of variables linked to this index. 

For a model with p inputs, the number of Sobol indices 
is 2^ — 1; leading to an intractable number of indices as p 
increases. Thus, to express the overall sensitivity of the out- 
put to an input X-. lHomma and Saltellil d 19961) introduce the 
total sensitivity index: 



St, = S, 



leifi 



(11) 



where #i represents all the "non-ordered" subsets of indices 
containing index /. Thus, Y^iem^i is the sum of all the sensi- 
tivity indices containing / in their index. For example, for a 
model with three input variables, St^ = Si +5i2 + ^13 + 5i23. 
The estimation of these indices can be do ne by Monte 



Carlo simulations or by alternative methods dSaltelh et al. 



2000h . Recent algorithms have also been introduced to re 
duce the number of required model evaluations significantly. 
As explained in the introduction, a powerful method con- 
sists in replacing complex computer models by metamodels 



which have negligible calculation time (e.g. IVoIkova et al. 



20081) . Estimation of Sobol indices by Monte-Carlo tech- 
niques with their confidence intervals (requiring thousand of 
simulations) can then be done using these response surfaces. 



3.2 Stochastic models 

In this work, models containing some intrinsic alea, which is 
described as an uncontrollable random input variable e, are 
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called "stochastic computer models". Let us recall the exam- 
ple proposed in the introduction where e is a random field 
whose each realization is governed by a random seed value. 
We consider the random field e as an uncontrollable vari- 
able when this random field is too complex to be described 
or synthesized by a reasonable number of scalar parameters. 

In the following, the expectation and variance operators 
involve averaging over the distribution of (X,e), unless an- 
other distribution is indicated. Similarly from equation (|5]l, 
consider the following (stochastic) model: 



X H^y = /(X) + v(e,X:e) 



(12) 



where X are the p controllable input variables (independent 
random variables), Y is the output, /(•) is the deterministic 
part of the model function and v(-) is the stochastic part of 
the model function. Let E£(v) = which means that v(-) is 
centered relatively to e: we put inside /(•) a possible con- 
stant term involved by v( ). The notation v(e,X : e) means 
that V depends only on e and on the interactions between 
e and X. The additive form of equation (fT2b is deduced di- 
rectly from the decomposition of the function g into a sum 
of elementary functions depending on (X,e) (like the de- 
composition in Eq. (|7])). 

For a stochastic model ( fTSl i. the joint models introduced 
in section |2] enables us to recover two GLMs, two GAMs or 
two Gps: 



y,„(X)=E(}'|X) = E,(}'|X) 

by the mean component (Eq. ([T])), and 

F,/(X)=Var(}'|X) = Var,(y|X) 



(13) 



(14) 



by the dispersion component (Eq. (|2])). If there is no uncon- 
trollable variable e, it leads to a deterministic model case 
with Fd(X) = Var(y|X) = 0. By using the total variance the- 
orem (Eq. (|6]l), the variance of the output variable Y can be 
decomposed by: 



Var(F) = Varx [E (F|X)] + Ex [Var (F|X)] 
= Varx[F„,(X)]+ExK/(X)] . 



According to model (fT2] i. y„, (X) is the deterministic model 
part, and ^^/(X) is the variance of the stochastic model part: 



F,„(X)=/(X), 

y^(X) = Var, [v(e,X:e)|X] 



(16) 



The variances of F and Y,„{\) are now decomposed ac- 
cording to the contributions of their input variables X. For Y, 
the same decomposition than for deterministic models holds 
(Eq. (|9]l). However, it includes the additional term Ex \Yd (X)] 



(the mean of the dispersion component) deduced from equa- 
tion ( fTsT l. Consequently, 



Var(F)=£y,-(n + LV,;(F)+ Vij,{Y) 

i i<j i< j<k 

+ ...+Vi2..p(l')+Ex[i'rf(X)] , 



(17) 



where Vi{Y) = Varx, [E(y Vij{Y) = Var(x,,x,.) [E(F|X,-Xy)] - 
Vi{Y) — Vj{Y)^ . . . For the mean component }'m(X) that we 
note Ym for easing the notation, we have 



p p p 

Var(F,„)=^\^-(F,„) + ^V,-,-(F„)+ ^ Vijk{Yr,^ 

i i< j i< j<k 

+ ...+Vi2..p(Fm) 



(18) 



By noticing that 

VKFm) = Varx,[Ex(r,„|X,)] = 
= Varx,[Ex.£(y|X,-)] 



Varx,{Ex[E,(F|X)|X,-]} 



(19) 



and from equation (fTOl i. the sensitivity indices for the vari- 
able Y according to the controllable variables X — {Xi)i=\,,,p 
can be computed using: 



S, 



V,{Y,n) 

Var(F) '■ 



Vij{Y,n) 

Var(F) 



(20) 



These Sobol indices can be computed by classical Monte- 
Carlo techniques, the same ones used in the deterministic 
model case. These algorithms are applied on the metamodel 
defined by the mean component F,,, of the joint model. 

Thus, all terms contained in Varx[Fm(X)] of the equation 
( fTSl l have been considered. It remains to estimate Ex [irf (X)] 
by a simple numerical integration of id(X) following the 
distribution of X. Yj^S.) is evaluated with a metamodel, for 
example the dispersion component of the joint model. Ex [i',n (X)] 
includes all the decomposition terms of Var(}') (according to 
X and e) not taken into account in Varx[i'm(X)] i.e. all terms 
involving e. Therefore, the total sensitivity index of £ is 



(15) Sn = 



Ex[y„,(x)] 

Var(y) 



(21) 



As Yj{%) is a positive random variable, positivity of Sj^ is 
guaranteed. In practice, Var(F) can be estimated from the 
data or from simulations of the fitted joint model: 



Var(y) = Varx[F™(X)] +Ex[F™(X)] 



(22) 



If Var(y) is computed from the data, it seems preferable to 
estimate Ex[l'm(X)] with Var(F) - Varx[F„,(X)] to satisfy 
equation ( fTsT l. In our applications, the total variance will be 
estimated using the fitted joint model (Eq. (|22]|). 
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Finally, let us note that we cannot quantitatively distin- 
guish the various contributions in St^ (Se, Sje, Sije, .. .). In- 
deed, it is not possible to combine the functional anova de- 
composition of Fm(X) with the functional anova decomposi- 
tion of Y^i{X) in order to deduce the unknown sensitivity in- 
dices. Finding a way to form some composite indices still re- 
mains an open problem which needs further research. How- 
ever, we argue that the analysis of the terms in the regression 
model Yd and their f-values give useful qualitative informa- 
tion. For example, if an input variable Xi is not present in Y^i, 
we can deduce the following correct information: Sie = 0. 
Moreover, if the f-values analysis and the deviance analysis 
show that an input variable X, has a smaller influence than 
another input variable Xj , we can suppose that the interac- 
tion between Xi and e is less influential than the interaction 
between and e. 

In conclusion, this new approach, based on joint models 
to compute Sobol sensitivity indices, is useful if the follow- 
ing conditions hold: 

- if the computer model contains some uncontrollable vari- 
ables (the model is no more deterministic but stochastic); 

- if a metamodel is needed due to large CPU times of the 
computer model; 

- if some of the uncontrollable variables interact with some 
controllable input ones; 

- if some information about the influence of the interac- 
tions between the uncontrollable variables and the other 
input variables is of interest. 



4 Applications 

4.1 An analytic test case: the Ishigami function 

The proposed method is first illustrated on an artificial an- 
alytical model with 3 input variables, called the Ishigami 
function (iHomma and Saltellii Il996t ISaltelh et aLil200a) : 



Y = f{Xi,X2,X3) = sm{Xi ) + 7 sm(X2)^ + 0. lX^sm(Xi ) 



(23) 



where X,- ~ [— tt; n] for ; = 1,2,3. For this function, all the 
Sobol sensitivity indices (Si, S2, S3, Sn, Sn, S23, Sm, ^r,, 
Stt, Stj,) are known. This function is used in most intercom- 
parison studies of global sensitivity analysis algorithms. In 
our study, the classical problem is altered by considering Xi 
and X2 as the controllable input random variables, and X^ as 
an uncontrollable input random variable. It means that the 
X3 random values are not used in the modeling procedure; 
this variable is considered to be inaccessible. However, sen- 
sitivity indices have the same theoretical values as in the 
standard case. 



For this analytical function case, it is easy to obtain the 
exact mean and dispersion models by deriving (via analyti- 
cal integration) the analytical expressions of the mean com- 
ponent Ym{Xi,X2) and dispersion component Yd{Xi,X2): 

Y,„{XuX2)^E{Y\XuX2) 

= (^l + ^)sin(Xi)+7[sin(X2)]2, 
Yd{XuX2)^Var{Y\XuX2) 

(24) 



4.7.7 Metamodeling 

For the model fitting, 1 000 Monte Carlo samples of [Xi , X2 , ) 
were simulated leading to 1000 observations for Y. No repli- 
cation is made in the {Xi,X2) plane because it has been 
shown that repeating calculations with the same sets of con- 
trollable vanabksj^inefficienHi^^ mode ling ap- 



proach (IZabalza et al.Lll998LlLee and Neldeii 120031) . There 



fore, we argue that it is better to keep all the possible experi- 
ments to optimally cover the input variable space (which can 
be highly dimensional in real problems). In practice, quasi- 
Monte C arlo sequences wil l be preferred to pure Monte Carlo 



samples ( Fang et al. , 2006h . 



In this section, the GLM, GAM and Gp model (with 
their relative joint extensions) are compared (see Table [T]i. 
To compare the predictivity of different metamodels, we use 
the predictivity coefficient Q2, which is the determination 
coefficient computed from a test sample (composed here 
by 10000 randomly chosen points). For each joint model, Q2 
is computed on the mean component. 

The simple GLM is a fourth order polynomial. Only the 
explanatory terms are selected in our regression model us- 
ing analysis of deviance and the Fisher statistics. The Stu- 
dent test on the regression coefficients and residuals graphi- 
cal analysis make it possible to judge the goodness of fit. We 
see that it remains 39% of non explained deviance due to the 
model inadequacy and/or to the uncontrollable variable. The 
mean component of the joint GLM gives the same model as 
the simple GLM. For the dispersion component, using anal- 
ysis of deviance techniques, no significant explanatory vari- 
able was found. Thus, the dispersion component is supposed 
to be constant; and the joint GLM is equivalent to the simple 
GLM approach - but with a different fitting process. 

Studying now the non parametric modeling, we start by 
the simple GAM fitting where we have kept some parametric 
terms by applying a term selection procedure. The predictiv- 
ity coefficient of the mean component of the joint GAM is 
slightly better than the predictivity coefficient of the simple 
GAM. However, the explained deviance given by the joint 
GAM mean component is clearly larger than the one given 
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Table 1 Results for the fitting of different metamodels for the Ishigami function, i'gxpl ^^^^ explained deviance of the model) and Q2 (the 
predictivity coefficient of the model) are expressed in percent. For the joint models, Og^pj and Q2 are those of the mean component F„,. In the 
formulas for GAM, si (■), S2{-) and sji (•) are three spline terms. 





^expl 


Qi 




Formula 


Simple GLM 
Joint GLM 


61.3 
61.3 


60.8 
60.8 


Y = 1.92- 
Y,„ = 1 .92 


f 2.69Xi + 2. 17X| - 0.29Xj' - 0.29X2'' 
+ 2.69Xi +2.nx^- 0.29X^ - 0.29X^ 
log(y^) = f.73 


Simple GAM 
Joint GAM 


76.8 
92.8 


75.1 
75.5 


Y = 

I'm = 


3.76-2.67X, +si(Xi)+j2(X2) 
3.75-3.06X,+i,(Xi)+i2(X2) 
log(y^)=0.59 + .v,(A',) 


Simple Gp 
Joint Gp 




75.0 
75.0 







by the simple GAM approach. Even if this could be related 
to an increasing number of parameters, as the number of 
parameters remains very small compared to the data size 
(1000), it is certainly explained by the fact that GAMs are 
more flexible than GLMs. This demonstrates the efficiency 
of the joint modeling of the mean and dispersion when het- 
eroscedasticity is involved. Indeed, the joint procedure leads 
to appropriate prior weights for the mean component. The 
joint GAM improves both the joint GLM and simple GAM 
approaches: 

(a) due to the GAMs flexibility, the explanatory variable Xi 
is identified to model the dispersion component (the interac- 
tion between Xi and the uncontrollable variable is there- 
fore retrieved); 

(b) the joint GAM explained deviance (93%) for the mean 
component is clearly larger than the simple GAM and joint 
GLM ones (joint GLM: 61%, simple GAM: 77%). 

For the Gp metam odel fitting, we use the methodology 
of iMarrel et alj (120081) which include in the model a lin- 
ear regression part and a Gp defined by a generalized ex- 
ponential covariance. We obtain for the simple Gp the pre- 
dictivity coefficient Q2 — 75.0%, which is extremely close 
to the one of the simple GAM (Q2 = 75.1%). The vari- 
ance of the nugget effect (additional error with constant vari- 
ance) introduced in the Gp model is estimated to 25.9% 
of the total variance, which is close to the expected value 
(1 ^ 62 = 25.0%). We can also fit, at present, a Gp model on 
the squared residuals to obtain a joint Gp model (cf. section 
I2.3I 1. In order to understand which inputs act in the disper- 
sion component, we compute the Sobol sensitivity indices of 
the dispersion component using a Monte Carlo algorithm: 
Syj{Xi) = 0.996 and Syj{X2) = 0.001. These results draw 
the same conclusion than those obtained from the disper- 
sion component equation of the joint GAM: X2 is not an 
explanatory factor for the dispersion. This also leads to the 
right conclusion that only Xi interacts with the uncontrol- 
lable variable X^ in the Ishigami function (|23]) . 

Let us now perform some graphical analyses in order to 
compare the results for the three joint models Joint GLM, 
joint GAM and Joint Gp. Figure [T] shows the observed re- 



sponse against the predicted values for the three models. 
First, the advantage of the GAM and Gp approaches are 
visible in the Figure [T] as the dispersion around the y — x 
line is clearly reduced compared to the joint GLM disper- 
sion. Graphical comparisons between Joint GAM and Joint 
Gp results do not provide any advantage for one particular 
model: similar biases are shown. Second, using the GAM 
model. Figure |2] compares the obtained residuals of a non 
parametric simple model (homoscedastic) with the obtained 
residuals of a non parametric joint model (heteroscedastic). 
The deviance residuals for the mean component of the joint 
GAM seem to be more homogeneously dispersed around the 
X-axis; leading to a better prediction on the whole range of 
the observations. Thus, the joint approach is more compet- 
itive than the simple one. From this simple graphical anal- 
yses, we conclude that a non parametric joint model (GAM 
or Gp) has to be preferred to other models (simple and/or 
parametric). 

In order to make a finer comparison between GLM, GAM 
and Gp models, we examine how well they predict the mean 
Y,n{Xi,X2) at inputs for which we have no data. We can also 
compare the different dispersion models Yij{Xi). The exact 
analytical expressions of Y,„ and Yii are given in Eq. ( l24b . 
Let us remark that we visualize Y^/ versus Xi only because, 
for GLM and GAM dispersion models, there is no depen- 
dence in X2 and, for the Gp dispersion model, there is an 
extremely small X2 -dependence (we then take Z2 = 0). Fig- 
ure [3] plots the theoretical Y,„ and Y^ surfaces (left panels) 
and their estimates derived from the fitted joint GLM, joint 
GAM and Joint Gp models. As shown before, the joint GLM 
is irrelevant for the mean component and for the dispersion 
component. The joint GAM fully reproduces the mean com- 
ponent, while joint Gp gives a rather good approximation, 
but with small noise. Indeed, spline terms of GAM are per- 
fect smoothers while Gp predictor is impacted by residual 
noise on the observations: the nugget effect does not allow 
to suppress all the noise induced by the uncontrollable vari- 
able. For the dispersion component, joint GP and joint GAM 
give result of the same quality: these models correctly re- 
produce the overall behaviour but with small inadequacies, 
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Joint GLM 



Joint GAIVI 



Joint Gp 




5 
Model 



5 
Model 



5 
Model 



Fig. 1 Observed response variable versus the predicted values for the three joint models: Joint GLM, Joint GAM and joint Gp (Ishigami applica- 
tion). 
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Fig. 2 Deviance residuals for the simple and joint GAMs versus the fitted values (Ishigami application). Dashed lines correspond to local polyno- 
mial smoothers. 



probably caused by overfitting problems. For the two dis- 
persion models, fitted observations have been taken from the 
residuals of the mean component learning sample. It would 
be convenient, in a future work, to test another solution by 
taking predicted residuals, for example by applying a cross 
validation procedure. 

We conclude that the joint GAM and joint Gp both ade- 
quately model the stochastic analytical model (the Ishigami 
function (|23]|). We let some fine comparisons between joint 
GAM and joint Gp for another study including a relevant an- 
alytical application. For example, an analytical model with 
strong and high order interactions will probably show the 
superiority of the Gp joint model (because spline high or- 



der interaction terms are difficult to include inside a GAM). 
Therefore, in the industrial application of section 14.21 we 
only use the models based on GLM and GAM, while Gp 
could also be applied. 

4.1.2 Sobol indices 

Table |2] depicts the Sobol sensitivity indices for the joint 
GLM, the joint GAM and joint Gp using equations (l20l l 
and (l2Tl l. The standard deviation estimates (sd) are obtained 
from 100 repetitions of the Monte-Carlo estimation proce- 
dure (which uses 10"* model computations for one index es- 
timation). When this Monte-Carlo procedure is used to es- 
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timate the Sobol index, we report "MC" in the "Method" 
column; while "Eq" (resp. "Sy/') indicates that the sensi- 
tivity indices have been deduced from the joint model re- 
gressive equations (resp. from the sensitivity analysis of the 
dispersion Y^). Therefore, no estimation errors (sd) are as- 
sociated to these indices (except for total indices Sjj which 
can be deduced from Si). When no quantitative deduction on 
the sensitivity index can be made with this process, we have 
put a variation interval which borders the true value. These 
variation intervals are deduced from the elementary relations 
between sensitivity indices (e.g. Si < Sti, Su < St^, etc). 

The joint GLM gives only a good estimation of , while 
^2 and S't-j are badly estimated (errors greater than 30%). 
5i2 is correctly estimated to zero by looking directly at the 
joint GLM mean component formula (see Table [TJ- How- 
ever, some conclusions drawn from the GLM dispersion com- 
ponent formula (which is a constant) are wrong. As no ex- 
planatory variable is involved in this formula, the deduced 
interaction indices are equal to zero: S13 = ^23 = ^123 = 0. 
Thus, ^3 — St^ — 0.366 while the correct values of ^3 and 
Stj are respectively zero and 0.243. 

Contrary to the joint GLM, the joint GAM and joint Gp 
give good approximations of all the Sobol indices. Their 
largest errors concern St^ for the joint GAM (7%-error) and 
joint Gp (16%-error). Moreover, the deductions drawn from 



the model formulas (see Table[T]i are correct (St2 = S2, S\2 = 
S23 — Si23 = 0). The only drawback of this joint model- 
based method is that some indices remain unknown due to 
the non separability of the dispersion component effects. 
However, it can be deduced that S13 is non null due to the 
explicative effect of Xi in the dispersion component. The 
deduced interval variations provide also useful information 
concerning the potential influence of the interactions. 

Table [3] gives the Sobol indices computed by the same 
Monte-Carlo procedure using two classical metamodels as 
the simple GAM and the simple Gp. To estimate the first 
order Sobol indices 5, = y,(y„,)/Var(y) (for / =1,2), the 
metamodel is used to compute V, (F,„) and the observed data 
(the 1000 observations of Y) to compute Var(F). To estimate 
the total sensitivity index Stt, of the uncontrollable variable, 
the metamodel predictivity coefficient Q2 is used. In fact, 
by supposing that the metamodels fit correctly the computer 
code, one deduces that all the unexplained part of these meta- 
models is due to the uncontrollable variable: St^ = I — Q2- 
This is a strong hypothesis, which is verified here due to the 
simplicity of the analytical function. However, it will not 
be satisfied for all application cases: in practical and com- 
plex situations, the Q2 estimation (usually done by a cross- 
validation method) can be difficult and subject to caution. 
For the Ishigami function, 5i , ^2, St^ are correctly estimated. 
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Table 2 Sobol sensitivity indices (with standard deviations) for tlie Ishigami function: exact and estimated values from joint GLM and joint 
GAM. "Method" indicates the estimation method: MC for the Monte-Carlo procedure, Eq for a deduction from the model equations and Sy^ for a 
deduction from the sensitivity analysis of (X) . 



Indices 


Exact 
Values 




Joint GLM 




Joint GAM 




Joint Gp 




Values 


sd 


Method 


Values 


set 


Method 


Values 


sd 


Method 


Si 


0.314 


0.314 


4e-3 


MC 


0.325 


5e-3 


MC 


0.292 


7e-3 


MC 


St 


0.442 


0.318 


5e-3 


MC 


0.414 


5e-3 


MC 


0.417 


7e-3 


MC 


St, 


0.244 


0.366 


2e-3 


MC 


0.261 


2e-3 


MC 


0.205 


le-3 


MC 


Sii 










Eq 







Eq 


0.004 


7e-3 


MC 


S,i 


0.244 







Eq 


]0,0.261] 




Eq 


]0,0.205] 






S21 










Eq 







Eq 









S123 










Eq 







Eq 









Sr, 


0.557 


0.314 


4e-3 


Eq 


10.325,0.586] 




Eq 


]0.292.0.497] 






St, 


0.443 


0.318 


5e-3 


Eq 


0.414 


5e-3 


Eq 


0.417 


7e-3 




S3 





0.366 


2e-3 


Eq 


[0,0.261] 




Eq 


[0.0.205] 







S12 can be deduced from the formula for the simple GAM 
(see Table [U and estimated by Monte-Carlo method for the 
Gp model. However, any other sensitivity indices can be pro- 
posed as no dispersion modeling is involved. 
Remark: Estimating the nugget effect variance of the Gp 
model mean component gives another estimation of the total 
sensitivity index of the uncotrollable variable. In this exam- 
ple, the variance of the nugget effect has been estimated to 
25.9% of the total variance, which is close to the exact value 
(24.4%). However, this estimation can be difficult in more 
complex situations, becaus e of a difficult optimization step 
while fitting the Gp model AFans et aZl \200d - IMarrel et al. 
200^) . 



In conclusion, this example shows that the joint non para- 
metric models can adjust complex heteroscedastic situations 
for which classical metamodels are inadequate. Moreover, 
the joint models offer a theoretical basis to compute effi- 
ciently global sensitivity indices of stochastic models. 

4.1.3 Convergence study 

In order to provide some practical guidance for the sam- 
pling size issue, we perform a convergence study for the es- 
timation of the joint GAM and the associated sensitivity in- 
dices. Figure m shows some convergence results for a learn- 
ing sample size n varying between 30 to 200 by step of 5. 
The learning points are sampled by the simple Monte Carlo 
technique. The predictivity coefficient Q2 is obtained from 
a test sample (composed of 1000 randomly chosen points). 
The total sensitivity index of the uncontrollable variable St^ 
is obtained by averaging the dispersion component (with 
le6 randomly chosen points). We can notice the rapid con- 
vergence of the predictivity coefficient Q2 and the slower 
convergence of £(7^;). The convergence speed for S\ and S2 
computed from the mean component are not shown here but 
are similar to the one of 22- 

From this particular case (low-dimensional but rather 
complex numerical model due to non linearities and strong 
interaction), we conclude that a 100-size sample is sufficient 
for fitting the joint GAM and for obtaining precise sensitiv- 



ity indices. Moreover, for the estimation of the total sensi- 
tivity index of the uncontrollable variable, using the predic- 
tivity coefficient of the mean component is highly recom- 
mended (instead of usin g the dispersion compon ent). With 
additional experiments, llooss and Ribatel (l2009h have con- 
firmed this result. 

In practice, the way to ensure that the convergence has 
been reached would be to visualize Q2 and its confidence in- 
terval (by a bootstrap technique for example) by resampling 
in the learning sample and progressively increasing its size. 



4.2 Application to an hydrogeologic transport code 

The joint approach is now applied to a complex industrial 
model of radioactive pollutants transport in saturated porous 
media using the MARTHE computer code (developed by 
BRGM, France). In the context of an environmental impact 
study, MARTHE has been applied to a model of strontium 
90 (^''Sr) transport in s aturated media for a r adwaste tempo- 



rary storage in Russia (IVolkova et al.L l2008l) . Only a partial 



characterization of the site has been made and, consequently, 
values of the model input variables are not known precisely: 
20 scalar input variables have been considered as random 
variables, each of them associated to a specified probabil- 
ity density function. The model output variables of interest 
concern the ^"Sr concentration values in different spatial lo- 
cations. One of the main goals of this study is to identify the 
most influential variables of the computer code in order to 
improve the characterization of the site in a judicious way. 
Because of large computing times of the MARTHE code, 
the Sobol sensitivity indices are comput ed using m etamod- 
els (boosting regression trees m odel for.Volkova et al. . 20081 



and Gaussian process mo del forlMarrel et al.Ll2008l) 



As a perspective of the lVolkova et al. I (l2008l) work. lloossl 



(120081) studies more precisely the influence of the spatial 
form of an hydrogeologic layer. The method consists in per- 
forming a geostatistical simulation of this layer (which is a 
two-dimensional spatial random field), before each calcula- 
tion of the computer model. This geostatistical simulation 



12 



Table 3 Sobol sensitivity indices (with standard deviations) for the Ishigami function: exact and estimated values from simple GAM and simple 
Gp model. "Method" indicates the estimation method: MC for the Monte-Carlo procedure, Eq for a deduction from the model equations and Q2 
for the deduction of the predictivity coefficient . 



Indices 


Exact 




Simple GAM 




Simple 


Gp 


Values 


Values 


sd 


Method 


Values 


sd 


Method 


Si 


0.314 


0.333 


6e-3 


MC 


0.292 


7e-3 


MC 


Si 


0.442 


0.441 


6e-3 


MC 


0.417 


7e-3 


MC 


St, 


0.244 


0.249 




Q2 


0.250 




Ql 


S12 










Eq 


0.004 


7e-3 


MC 




Sample size Sample size 

Fig. 4 For the Ishigami function, mean and 90%-confidence interval (based on 100 replicates) of joint GAM Q2 and 57-3 in function of the learning 
sample size n. 



is rather complex and the resulting spatial field cannot be 
summarized by a few scalar values. Therefore, as explained 
in our introduction, this hydrogeologic layer form has to 
be considered as an uncontrollable variable of the computer 
model. Additionally to the uncontrollable variable, 16 scalar 
input variables remain uncertain and are treated as random 
variables. It concerns the permeability of different geologi- 
cal layers, the longitudinal and transversal dispersivity coef- 
ficients, the sorption coefficients, the porosity and meteoric 
water infiltration intensities. 



In order to keep coherence with IVolkova et al.l (l2008l) 
previous study, the learning sample size has been chosen 
to be the same: = 300. This size is in adequation with 
the heuristic recommandation of 10 observations per input 



dimension (Loeppkv et al., 2008", Marrel et al. , 2008). used 



in most of the practical studies on deterministic computer 
codes. The Latin Hypercube Sampling method is used to 
obtained a sample of A^ random vectors (each one of di- 
mension 16). In addition, A^ independent realizations of the 
spatial random field (noticed by e) are o btained by a spe- 
cific geostatistical simulation algorithm ( loossl 12008 ). Per- 
forming independent realizations for each of the simulator 
run has been imposed by the small number of available runs 
(300) relatively to the high-dimensional model (20). More- 
over, one of our primary concern was also to perform an 



uncertainty propagation study, in which replicates have to 
be avoided. In any case, more interesting designs should be 
chosen, making replicates for example by changing the con- 
trollable input factors while keeping fixed the geostatistical 
realization. However, su ch ideas are well beyond the scope 
of the current paper (see lAnderson-Cook et all l2009i for a 
recent review about the design issue). 



After 8 calculation days, we obtain 300 observations of 
the output variable of the MARTHE model (^"Sr concentra- 
tion at the domain center). As two computer runs have given 
incoherent values, we keep 298 observations. For the GLMs 
and GAMs construction phase, the large data dispersion sug- 
gests the use of logarithmic link functions for g and h (see 
Eqs ^ and (|2]i). Due to the large number of inputs, a manual 
term selection process has been applied. No interaction term 
has been found to be explicative in the GLMs. However, a 
bi-dimensional spline term has been added in the GAMs be- 
cause of convincing deviance contribution and negligible p- 
value. To find this significant interaction term, we have not 
introduced in the model all the 120 interaction terms. We 
have sequentially tested all the interaction terms involving 
one significant first order term (kdl, kd2, perl and per3) 
and each other factor. Then, we keep the interaction terms 
which show some explanatory contribution to the model. 
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The results are summarized below by giving the explained 
deviance and the explanatory terms involved in the formu- 
las: 

- Simple GLM: Dgxpi 60% with the terms kdl, kd2, 
perl, perl. 

- Joint GLM: Dgxpi(mean) = 66.4%, with the same terms 
than the simple GLM, Dg^pi (dispersion) = 8.7% with 
the terms kdl and per3>. 

- Simple GAM: Dgxpi = 81 .8% with s{kdl), s{kd2), s{per3), 
s{per2,kd2). 

- Joint GAM: D^y^^\{mtw) =98.1% with the same terms 
than the simple GAM, Dg^pi (dispersion) — 29.7% with 
kdl, kdl. 

kdl, kdl and perl, perl, per3 are respectively the sorption 
coefficients and the permeabilities of the different hydrogeo- 
logic layers. One observes that the GAM models outperform 
the GLM ones. The predictivity coefficient (computed by 
the leave-one-out method) of the simple GAM gives Q2 — 
76.4%, while for the simple GLM Q2 = 58.8%. 

Figure |5] shows the deviance residuals against the fitted 
values for the joint GLM, simple GAM and joint GAM mod- 
els. For the joint GLM approach, some outliers are not visi- 
ble to keep the figure readable. As a consequence, the GAMs 
clearly lead to smaller residuals. Moreover, the joint GAM 
outperforms the simple GAM due to the right explanation of 
the dispersion component. It can be seen that the joint GAM 
allows to suppress the bias involved by the heteroscedastic- 
ity, while simple GAM residuals are affected by this bias. 

Figure |6] shows the proportion A of observations that 
lie within the a% theoretical confidence interval against the 
confidence interval a. By definition, if a model is suited for 
both mean and dispersion modelings, the points should be 
located around the y = x line. As a consequence, this plot is 
useful to compare the goodness of fit for the different mod- 
els. It can be seen that the joint GAM is clearly the most 
accurate model. Indeed, all its points are close to the theo- 
retical y — X line, while the joint GLM (resp. simple GAM) 
systematically leads to underestimations (resp. overestima- 
tions). Consequently, from the FigureslsHU one deduces that 
the joint GAM model is the most competitive one. On one 
hand, the mean component is modeled accurately without 
any bias. On the other hand, the dispersion component is 
competitively modeled leading to reliable confidence inter- 
vals. 

Table Ugives the main Sobol sensitivity indices for the 
joint GLM, joint GAM and simple GAM (using lO'* model 
computations for one index estimation). The Sobol indices 
of the interactions between controllable variables are not 
given (except between kdl and perl) because these inter- 
actions are not included in the formulas of the two joint 
models. Therefore, their Sobol indices are zero. The two 
joint models give similar results for all first order sensitiv- 
ity indices. The sorption coefficient of the second layer kdl 



explained more than 52% of the output variance, while the 
permeability of the second layer perl explained more than 
5%. Some large differences arise in the total influence of 
the uncontrollable variable e: 38.2% for the joint GLM and 
27.7% for the joint GAM. Moreover, the joint GLM shows 
an influence of the interaction between per3 and e, while the 
joint GAM shows an influence of the interaction between 
kdl and £. In this application, we consider the joint GAM 
results more reliable than the joint GLM ones because the 
joint GAM captures more efficiently the mean and disper- 
sion components of the data than the joint GLM. 

By comparing the joint GAM results with the simple 
GAM results, some significant differences can be printed 
out: 

- The kdl first order sensitivity index is overestimated us- 
ing the simple GAM (14.0% instead of 3.7% for thejoint 
GAM). Indeed, the deviance analysis of the joint GAM 
dispersion component shows a high contribution of kdl, 
which means that the interaction between kdl and the 
uncontrollable variable is probably large. For a standard 
metamodel, like the simple GAM, this interaction is not 
found out and leads to a wrong estimation of the first 
order sensitivity index of Ml. 

- For the simple metamodels, using the relation ^^(e) = 
I — Q2, the total sensitivity index of the uncontrollable 
variable is underestimated: 23.5% (simple GAM) instead 
of n.1% (joint GAM). The classical metamodels tend to 
explain some parts of the data which can be adequately 
included in the dispersion component of the joint GAM 
during the iterative fitting algorithm. 

- Contrary to the other metamodels, thejoint GAM allows 
to prove that only kdl and kdl interact with the uncon- 
trollable variable. 

As a conclusion, these sensitivity analysis results will be 
very useful to the physicist or the modeling engineer during 
the model construction and calibration steps. In this specific 
application, the sensitivity analysis shows that the geometry 
of the second hydrogeological layer has a strong influence 
(up to 28%) on the predicted ^''Sr concentration. Therefore, 
an accurate modeling of this geometry, coupled with a bet- 
ter knowledge of the most influential variable kdl, are the 
key steps to an important reduction of the model prediction 
uncertainties. 

5 Conclusion 

This paper has proposed a solution to compute variance- 
based sensitivity indices of stochastic computer model out- 
puts. It consists in modeling the mean and the dispersion 
of the code outputs by two explanatory models. The clas- 
sical way is to separately build these models. In this pa- 
per, the use of the joint modeling is preferred. This the- 
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Fig. 5 Deviance residuals (mean component) for the Simple GAM, loint GAM and loint GLM versus the titted values (MARTHE application). 
Dashed lines correspond to local polynomial smoothers. 
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Fig. 6 Proportion A (in percent) of observation that lie within the a% theoretical confidence interval in function of the confidence level a. 
MARTHE application. 



ory, proposed by Pregibon| ( 1984 ) and 



extensively developed by iNeldeil (119981) . is a powerful tool 



SmvthI (Il989 l). then 



to fit the mean and dispersion components simultaneously. 
Zabalza et al. 1 11998') already applied this approach to model 
stochastic computer code. However, the behavior of some 
numerical models can be highly complex and non linear. In 
the present paper, some examples show the limit of this para- 
metric joint model. Being inspired bv Rigbv and Stasinopoulos 
d 19961) who use non parametric joint additive models (re- 
stricted to Gaussian cases), we have developed a more gen- 
eral joint model using GAMs and quasi distributions. Like 
GLMs, GAMs are a suited framework because it allows vari- 
able and model selections via quasi-likelihood function, clas- 
sical statistical tests on coefficients and graphical displays. 
Additional works using joint GLMs and joint GAMs for 
computer experiments can be found in llooss and Ribateti (120091) 

The joint GAM has proven its flexibility to fit complex 
data: we have obtained the same performance for its mean 
and dispersion components as the powerful Gp model. Deal- 



ing with computer codes involving many factors and strong 
interactions between model factors, it would be convenient 
to look more precisely at other joint models, as the joint Gp 
model we have shortly described and used. An analytic case 
on the Ishigami function shows that these two non paramet- 
ric joint models (GAM and Gp) are adapted to complex het- 
eroscedastic situations where classical metamodels are in- 
adequate. Moreover, it offers a theoretical basis to compute 
Sobol sensitivity indices in an efficient way. The analytical 
formulas available with the joint GAM are very useful to 
complete the sensitivity analysis results and to improve our 
model understanding and knowledge. 

The performance of the joint model approach was as- 
sessed on an industrial application. Compared to other meth- 
ods, the modeling of the dispersion component allows to ob- 
tain a robust estimation of the total sensitivity index of the 
uncontrollable variable, which leads to correct estimations 
of the first order indices of the controllable variables. In ad- 
dition, it reveals the influential interactions between the un- 
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Table 4 Estimated Sobol sensitivity indices (with standard deviations obtained by 100 repetitions) for the MARTHE code. "Method" indicates the 
estimation method: MC for the Monte-Carlo procedure, Eq for a deduction from the model equations and Q2 for the deduction of the predictivity 
coefficient Q2. " — " indicates that the value is not available. 



Joint GLM Joint GAM Simple GAM 

Values s(l Method Values s(l Method Values Method 



S(kdl) 


0.002 


0.6e-2 


MC 


0.037 


l.Oe-2 


MC 


0.140 


l.Oe-2 


MC 


S(kd2) 


0.522 


0.6e-2 


MC 


0.524 


l.Oe-2 


MC 


0.550 


l.le-2 


MC 


S(perl) 


0.018 


0.7e-2 


MC 







Eq 







Eq 


S(per2) 


0.052 


0.6e-2 


MC 


0.078 


l.Oe-2 


MC 


0.044 


l.Oe-2 


MC 


S(per3) 







Eq 


0.005 


l.Oe-2 


MC 


0.008 


l.Oe-2 


MC 


S(kd2,per2) 







Eq 


0.063 


l.Oe-2 


MC 


0.026 


l.Oe-2 


MC 


Sr(e) 


0.382 


0.2e-2 


MC 


0.277 


0.3e-2 


MC 


0.235 






S(kdl,e) 


]0,0.382] 




Eq 


]0.0.277] 




Eq 








S(kd2,£) 







Eq 


]0.0.277] 




Eq 








S(perl,£) 







Eq 







Eq 








S(per2,£) 







Eq 







Eq 








S(per3,£) 


]0,0.382] 




Eq 







Eq 









controllable variable and the other input variables. Obtain- 
ing quantitative values for these interaction effects is still 
an open issue, but a challenging problem. Finally, the joint 
model would also serve in the uncertainty propagation stud- 
ies of complex models, to obtain the full distribution of the 
model output. 

In the whole, all statistical analysis were performed us- 
ing th e R software environment Develo pment Core Team , 
2006h . In particular, the following functions and packages 
were useful: the "glm" function to fit a simple GLM, the 
"mgcv" (Multiple Smoothing Parameter Estimation by GCV) 
package to fit a simple GAM, and the "sensitivity" pack- 
age to compute Sobol indices. We also developed the "Joint- 
Modeling" package to fit joint models (including joint GLM 
and joint GAM). 
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